Assignment 2

Time Series Learning

Inbal Bitton and Shachar Meretz

In this part of the assignment we will present an analysis of a data set containing data and indices measured for a number of subjects who were asked to perform a particular protocol containing a number of physical activities that the subjects performed, and our goal is to use this data collected from the subjects to classify what activity was performed.

Now we define a number of functions that we will work with during the assignment - in some of the steps we will load the training data set containing the subjects 101,102,103,104,105,106,109 and the test set containing the subjects 107 and 108.

In the pre-processing of the data we will remove the records relating to activity 0 as detailed in the task description file - this activity was obtained when there is no indication of the activity that the subject was asked to perform so we will remove these records so we not include them in the learning process. In addition we will have to deal with missing values ​​- we will fill these missing values ​​by using the mean value of each measure

In the pre-processing data for the LSTM model we will want to fill in missing values ​​and remove activity 0, and in addition to normalize the values ​​that will be between 0-1. This will help us deal with the memory problem we will need to implement the operations with this model and in addition learning will be faster and more efficient by using this range of values.

Loading Activites Table - for each Activity ID we can get her name

Section 1

Our Data Set

The Data set contains data from 3 sensors worn on 9 subjects - 8 male and 1 female.The Subjects were asked to perform activities from a list of 12 activities listed below. The three sensors are located on the hand, center of mass (chest) and ankle.The sensors measure a number of indices within a time frame of 0.01 seconds and in addition to the sensors' indices we measure the heart rate of the subject at each timestamp.

Since this is a short time frame and metrics obtained from sensors, we will have to deal with missing values. We will fill these missing values for each subject by finding the mean value for that measure.

Our problem presented here is a classification problem within Time series - for a given section of time we want to know, according to the metrics that have been studied and measured that presented to us in the data set, what physical activity that person performs. Therefore we will need to prepare our data set and study it in a generic way and not for each subject individually.

We will prepare the data by dealing with missing values, deleting records in which we do not have a classification of a particular activity (activities with the identification number 0) and consolidating the subjects' records into one large table.

Our training set was built using subjects 101,102,103,104,105,106,109 - we will split this sat to train and validation

And our test set was built using subjects 107,108

For each data set we will add 2 columns that will help us prepare the data for learning - a column for the subject's name, a column that will be an indication to us of the continuous execution of a particular activity - "sec index".

In addition the "Activity ID" and "Time Stamp" column will not be features in the learning process.

Here we can see the amount of missing values ​​we have for the various indices before we process the information and prepare the data

We want to fill in the missing data and remove columns that are not relevant or we do not want to generalize because of a place in memory.

We will now process and prepare the data for the training set and delete the last four columns we have for each IMU sensor, according to the information we have on the data these columns contain data that is not relevant to this problem.

The columns we remove are - "Orientation_X","Orientation_Y","Orientation_Z","Orientation_W" - for Each IMU

The Number of Samples in Train Set - number of features or "Channels" in the future is - 40 (not including columns - "Name", "Activity ID", "sec index", "Time Stamp")

The same process will be performed on the test set as well

Amount of missing values

Exploratory Data Analysis

Here we can see how much time in seconds each subject masured each activity, and how much time in seconds we measured each subject and each activity

In the Train Set we can see that for subject 109 we have a very small sampels comparing to other subjects, And there are several activities which we have very small amount of records - watchin TV , car driving ,etc.

In the Test Set we can see that for subject 107 and subject 108 we have a large amount of samples but for several actitvities we do not have any samples for several activities

Dimenssion and Shape of the Train Set and the Test Set

All the 40 features (not including 'Activity ID' , 'Time Stamp' , 'sec index' , 'Name' ) we have will be Channels in the learning process.

Graphs and Analysis Data

Samples For Each Activity

It can be seen from the Entropy value that we get that the data is relatively balanced because the maximum Entropy value we get is a log in base 2 of the number of activities i.e. for 18 activities we will need approximately 4.2 bits to describe the information and as our entropy is higher the distribution of records for each activity is uniform.

But despite this there are a small number of activities we do not have many examples - playing soccer, jumping rope, car driving

Sampels For Each Subject

Here, too, it can be seen from the graph and entropy array we obtained that the uncertainty increases and the number of records for each subject is distributed relatively uniformly, so it can be said that the information is balanced - except subject 109 and 103 that we have small amount of samples compering to other subjects.

Mean Temperture For Each Subject

It can be seen that for almost every subject the highest temperature was measured from the sensor which is on the chest in the center of mass, except for subject number 105 which can be seen as a cut in terms of temperature on the hand and ankle.

Mean Heart Rate For Each Subject

Mean Temperture For Each Activity

Here too it can be seen that for all activity the highest temperature measured is from the sensor located in the chest. In addition it is interesting to see that for the three sensors when the activity measured is an activity that combines aerobics - running, cycling, jumping rope etc. the average temperature in this activity decreases while the temperature increases when we perform activities that are not perceived as demanding.

Mean Heart Rate For Each Activity

Here it can be seen that in terms of heart rate the more aerobic activity is, the higher the heart rate.

Corelation Graphs

Here we will present some graphs that will help us find certain patterns based on relationships between different indices measured, we will mainly present correlations between indices obtained from the chest sensor

In these graphs we can see that in terms of the the acceleration in the center of mass, as we approach values ​​0 and 40 in the corresponding axes we move between 3 main activities - soccer, jumping rope and running - aerobic activities in which we expect a change in acceleration according to mass center

Here too it can be seen that the position of the sensor which is in the center of mass shows that for gyroscope in the center to values ​​0,0 is most often found in aerobic activities that combine movement - for example soccer as can be seen and even jumping rope and running can be seen a small cluster of points - mostly in center values of -2.5-0 in Y and 5 in X.

For the magnetometer values ​​located in the center of the mass more significant patterns can be seen

Mean 3D Acceleration Data For Each IMU

In the following graphs we can see the behavior of the different vectors for all the IMU sensors of the subjects according to the different activities measured

Hand

Chest

Ankle

Self-Supervised Tasks

As we learned in the lecture, it is difficult to obtain data that already has classifications and labels, and it is also difficult to train a model using an unattended method. In order to have a more effective learning process and start from a better starting point it is possible to perform self-supervision tasks.

a few examples:

Section 2

Validation and Training strategy

Our division strategy is to leave a number of subjects in a separate group that will be the validation group - we will test when we leave one subject out according to the leave one out strategy. Any division of the subjects into a validation and training sets will constitute a fold for us.

We will not include subjects 109 in the validation group because subject 109 performs activities that most of the other subjects do not perform and therefore we would like to study him and not perform a test on him.

Also regarding test subject 101 - we would like to study him and not perform tests on him because he encountered activities that only he performed.

For the naive model and the solid model we will perform an average of the different indices for 200 consecutive time points and belong to the same subject for the same activity. That is, we get a new record that contains the average value for each measure.

Therefore, we will separate the preparation of the information for each task:

Naive Base Line

Our naive model will determine according to the rule a decision regarding the Heart Rate during a particular activity. As we showed in the graphs for each activity measured we got a value for the average Heart Rate in each activity. We will use these values ​​as threshold values ​​to test for a given time segment with an average Heart Rate - for which activity we are expected to receive. We will create a bin for each activity and apply the validation group to the bins we received and in this way we will classify which activity each subject performed within a certain time frame - 2 sec.

Train

Test

Solid Benchmark

As a model that will give us more solid values ​​we will use a model of linear regression

Train

Prediction vs True Labels for subject 103 as validation group

Test

We found that the distribution that gave us the best results was when subject 103 was in the validation group so we were again loyal to a regression model with the help of the training group that did not include subject 103

NN Model - LSTM

In this part we will create an LSTM model - the input for this model will be a matrix that we will create for 200 points in time - 2 seconds, when we make a jump of 30 points in time - 0.3 seconds, and at each point in time we will insert 40 different features - input channels.

Train

We can see that we have some overfitting and also extreme jumps in values of Loss and Accuracy on the Validation Set

Test

It can be seen that subject 103 gave us the best performance when we used it as a validation group but higher accuracy on the test we got for the model we train when subject 104 was used as a validation group - 33.83796%

To summarize this section it can be seen that we obtained an accuracy on the test set lower than the linear regression model and also in the training process our accuracy is lower than the performance of the regression model we trained

Pretrain our NN Model

As we described the tasks in question 1 we will try to perform the first task we suggested - pre-training the model on the problem of predicting a particular feature based on using the features we have so that we can get good starting weights and improve our model.

We will try to make a prediction for the temperature measured in the subject's hand at any given time based on the other features given to us using the same model we built in the previous section

Predict feature using other features

We will want to make a prediction for each point in time based on 3 features - these will be the matrix sizes we put into the LSTM model we built

It can be seen that from the number of epochs we performed we were unable to reduce the value of the loss

Train Again The Model

Train

It can be seen that most of the divisions we made have an increase in the accuracy values ​​after the pre-processing training we did for the model but in terms of loss we got a negative improvement meaning our loss values ​​increased ​​- probably the task we performed does not fit the model and we have to perform another task or use more features to predict hand temperature.

Test

Also here we can see that the higher accuracy that we get is when we predict with the model that we get when subject 106 was used as a validation group and we get a higher accuracy that the task befor so the self-surpervised task that we made was halpful , but we still suffer from overfitting and we still getting heigher losses values than the values on the train set.

How To Improve The NN Model?

Several ways we will try to improve our model:

Improvement 1 - Perform Another Self-Supervised Task

We will try to predict the acceleration value on the X-axis of the Chest IMU sensor based on 10 points in time backwards when we make a jump of 10 points forward in time.

It can be seen that compared to the self-learning we performed earlier we got lower loss values

Try Again on Train Set

Continue with other subject as validation due to RAM LIMIT

Now we usually have an increase in loss values ​​and a small increase in accuracy values ​​but even here we still suffer from overfitting and extreme jumps between the various epochs.

Try Again on Test Set

Unlike the previous tasks when we predict the test set we used the trained model we received that gave us the best performance both in terms of accuracy and loss values ​​and indeed we got a high percentage of accuracy on the test set, here too there is improvement - we can conclude that the self-supervised task we performed did contribute to learning and weight gain.

Improvement 2 - Add Deep Layers And Increas Dropout

Train

Continue subject 105 and 106 here because of RAM limitations

Test

Improvement 3 - Reducing Dropout Values And Add Another Dense Layer with The Same Size

Reducing the dropout values ​​and adding a layer at a similar depth - In this way we will try to keep the important features ​​and filter noise factors with the help of the 2 layers that are the same size

Train

Test

wqe.PNG

We can see that we managed to get an improvement on the test set, and in some of the training we managed to deal with overfitting but there are still extreme jumps between the various epochs, we still can not lower the loss values and yet in most cases we are still suffering from overfitting.

On the other hand, it can be seen from the examples of the classifications that we were able to separate activities in which we made mistakes in the past, mainly between the activities - soccer and - running and rope jumping.

In addition, it can be seen that there are many activities that the models we trained can not separate - the collection of activities of ironing, folding laundry, descending stairs, vacuum cleaning and also the collection of activities - walking, ascending stairs and descending stairs. These activities are relatively similar in terms of the effort and changes that occur in the body as a result of performing these activities and therefore we will need our model to be more sensitive to the small changes that differentiate between these activities.

Summary

In this part we encountered many challenges - efficient management of the memory allocated to us in RAM, learning and working with time segments, creating a strategy of validation training - in-depth understanding of the data that will allow us to perform the spliting in the most efficient and correct way.

During the work we performed several experiments and several attempts of models to try and understand how things work, how each action we perform and each layer we add will help us, what is the importance of normalization and data preparation in the best way - filling in missing values ​​with concepts of interpolation, finding patterns, self-learning of Model, things we did not deal with in the previous work and now we have come to understand what each thing means and see the impact on the model built and the values ​​we will receive.